On the Complexity of Query Answering under 
Matching Dependencies for Entity Resolution 

Leopoldo Bertossi Jaffer Gardezi 

Carleton University, SCS University of Ottawa, SITE. 

Ottawa, Canada Ottawa, Canada 

. . , Abstract. Matching Dependencies (MDs) are a relatively recent proposal for 

CO ' declarative entity resolution. They are rules that specify, given the similarities 

^~~^ ' satisfied by values in a database, what values should be considered duplicates, and 

>--^ , have to be matched. On the basis of a chase-like procedure for MD enforcement, 

we can obtain clean (duplicate-free) instances; actually possibly several of them. 
^^' The resolved answers to queries are those that are invariant under the resulting 

C^ , class of resolved instances. In previous work we identified some tractable cases 

(i.e. for certain classes of queries and MDs) of resolved query answering. In this 
paper we further investigate the complexity of this problem, identifying some 
\^ intractable cases. For a special case we obtain a dichotomy complexity result. 
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1 Introduction 

A database may contain several representations of the same external entity. In this 

Q sense it contains "duplicates", which is in general considered to be undesirable. And 

the database has to be cleaned. More precisely, the problem of duplicate- or entity- 
C/3 ' resolution (ER) is about (a) detecting duplicates, and (b) merging duplicate represen- 

tations into single representations. This is a classic and complex problem in data man- 
agement, and in data cleaning in particular [9, 11, 4]. In this work we concentrate on 
CN ! the merging part of the problem, in a relational context. 

Kf" ' A generic way to approach the problem consists in specifying what attribute val- 

ues have to be matched (made identical) under what conditions. A declarative language 
with a precise semantics could be used for this purpose. In this direction, matching 
pZI I dependencies (MDs) have been recently introduced [12]. They represent rules for re- 

solving pairs of duplicate representations (considering two tuples at a time). Actually, 
when certain similarity relationships between attribute values hold, an MD indicates 
what attribute values have to be made the same (matched). 
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Example 1. The similarities of phone and address indicate that the tuples refer to the 
same person, and the names should be matched. Here, 723-9583 f» (750) 723-9583 and 
10-43 Oak St. ^ 43 Oak St. Ap. 10. 



People (P) 



Name 



John Smith 
J. Smith 



Phone 



723-9583 
(750) 723-9583 



Address 



10-43 Oak St. 
43 Oak St. Ap. 10 



An MD capturing this cleaning policy, could be the following: 

P[Phone] « P[Phone] A P[Address] w P[Address] -^P[Name] = P[Name]. 

This MD involves only one database predicate, but in general, an MD may involve two 
different relations. D 

Here we report on new results (in Section 4) on the computation of resolved query an- 
swers wrt. a set of MDs, i.e. of those answers that are invariant under the MD-based ER 
process. We identify syntactic classes of MDs for which, computing resolved answers 
to conjunctive queries in a syntactic class, is always intractable. 



2 Preliminaries 

We assume we are dealing with relational schemas and instances. Matching dependen- 
cies (MDs) are symbolic rules of the form: 

/\R[A,] «,, S[Bj] ^ /\R[Ak] = S[Bil (1) 

i.j k,l 

where i?, S are relational predicates, and the Ai, ... are attributes for them. The LHS 
captures similarity conditions on a pair of tuples belonging to the extensions of R and 
S in an instance D. We abbreviate this formula as; R[A] k. S[B] -^ R[C] = S[E]. 
MDs have a dynamic interpretation requiring that those values on the RHS should be 
updated to some (unspecified) common value. Those attributes on a RHS of an MD are 
called changeable attributes. 

The similarity predicates sa (there may be more than one in an MD depending on 
the attributes involved) are treated here as built-ins, but are assumed to satisfy: (a) sym- 
metry: if a; « y, then y ^ x\ and (b) equality subsumption: if x ~ y, then x k, y. 
However, transitivity is not assumed (and in some application it may not hold). 

MDs are to be "applied" iteratively until duplicates are solved. In order to keep 
track of the changes and comparing tuples and instances, we use global tuple identi- 
fiers, a non-changeable surrogate key for each database predicate that has changeable 
attributes. The auxiliary, extra attribute (when shown) appears as the first attribute in a 
relation, e.g. t is the identifier in i?(i, x). A position is a pair (i, A) with t a tuple id, 
and A an attribute (of the relation where t is an id). T\\t positioyi's value, t[A\, is the 
value for A in tuple (with id) t. 

A semantics for MDs acting on database instances was proposed in [13]. It is based 
on a chase procedure that is iteratively applied to the original instance D. A resolved 
instance D' is obtained from a finitely terminating sequence of instances, say 

D^ Di^ D2^ ■■■^ D', (2) 

terminating in D' , that satisfies the MDs as equality generating dependencies [1], i.e. 
replacing = by equality. 

The semantics specifies the one-step transitions or updates allowed to go from Di-i 
to Di, i.e. "i— )•" in (2). Only modifiable positions within the instance are allowed to 
change their values in such a step, and as forced by the MDs. Actually, the modifiable 
positions syntactically depend on a whole set M of MDs and instance at hand; and 
can be recursively defined (see [13, 14] for the details). Intuitively, a position {t, A) is 
modifiable iff: (a) There is a t' such that t and t' satisfy the similarity condition of an 
MD with A on the RHS; or (b) t[A] has not already been resolved (it is different from 
one of its other duplicates). 

Example 2. Consider the MD R[A] = R[A] -^ R[B] = R[B], and the instance R{D) 
below. The positions of the underlined values in D are modifiable, because their values 
are unresolved (wrt the MD). 
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D' is a resolved instance since it satisfies 
the MD interpreted as an FD (the update 
value d is arbitrary). 



D' has no modifiable positions with unresolved values: the values for B are already the 
same, so there is no reason to change them. D 
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More formally, the single step semantics is a follows. Each pair Di^Dj^i in an up- 
date sequence (2), i.e. a chase step, must satisfy the set M of MDs, modulo unmod- 
ifiability, denoted {Di,Di+i) \=um M, which holds iff: (a) For every MD, say 
R[A] w S[B] -^ R[C] = S[D] and pair of tuples t^ and ts, if tR[A] w ts[B] in Di, 
then iflJC] = ts[D] in I?i+i; and (b) The value of a position can only differ between 
Di and i'i+i if it is modifiable wrt Di. 

This semantics stays as close as possible to the spirit of the MDs as originally in- 
troduced [12], and also uncommitted in the sense that the MDs do not specify how the 
matchings have to be realized. ' 

Example 3. Consider the following instance and set of MDs. Here, attribute R{C) is 
changeable. Position (^2, C) is not modifiable wrt. M and D: There is no justification 



R[A] = R[A] -^ R[B] = R[B] 
R.[B] ^ R.[B] -^ R[C] ^ R[C]. 

to change its value in one step on the basis of an MD and D. However, position (ii, C) 
is modifiable. We obtain two resolved instances for D: Di and Z?2 below. 

Di cannot be obtained in a single (one 
step) update since the underlined value is 
for a non-modifiable position. However, 
Z?2 can. D 

Among the resolved instances we prefer those that are closest to the original instance. 
Accordingly, a minimally resolved instance (MRI) of D is a resolved instance D' such 
that the number of changes of attribute values comparing D with D' is a minimum. 
In Example 3, instance D2 is an MRI, but not Di (2 vs. 3 changes). We denote with 
Res{D, M) and MinRes{D, M) the classes of resolved, resp. minimally resolved, in- 
stances of D wrt M. 

Given a conjunctive query Q, a set of MDs M, and an instance D, the resolved an- 
swers to Q from D are those that are invariant under the entity resolution process, i.e. 
they are answers to Q that are true in all MRIs of D: ResAnSM{Q, D) := {c | Z?' |= 
Q[c] , for every D' G MinRes{D, M)}. We denote with RA{Q, M) the decision prob- 
lem {(L»,c) I ce ResAnsM{Q,D)}. 

The definition of resolved answer is reminiscent of that of consistent query answers 
(CQA) in databases that may not satisfy given integrity constraints (ICs) [2, 5]. Much 
research in CQA has been about developing (polynomial-time) query rewriting method- 
ologies. The idea is to rewrite a query, say conjunctive, into a new query such that the 
new query on the inconsistent database returns as usual answers the consistent answers 
to the original query. In all the cases identified in the literature on CQA (see [6] for 
a survey, and [17] for recent results) depending on the class of conjunctive query and 
ICs involved, the rewritings that produce polynomial time CQA have been first-order. 
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' We have proposed and investigated other semantics. One of them is as above, but with a modi- 
fied chase conditions, e.g. applying one MD at a time. Another one imposes that previous res- 
olutions cannot be unresolved. In [7, 8, 3] a semantics that uses matching functions to choose 
a value for a match is developed. 



Doing something similar for resolved query answering (RQA) under MDs brings new 
challenges; (a) MDs contain the non-transitive similarity predicates, (b) Enforcing con- 
sistency of updates requires computing the transitive closure of such operators, (c) The 
minimality of value changes (that is not always used in CQA or considered for consis- 
tent rewritings). (d) The semantics of resolved query answering for MD-based entity 
resolution is given, in the end, in terms of a chase procedure." However, the semantics 
of CQA is model-theoretic, given in terms repairs that are not operationally defined, but 
arise from set- theoretic conditions.^ 

3 Tractability and Datalog Query Rewriting 

In [14, 15], a query rewriting methodology for RQA under MDs was presented. In this 
case, the rewritten queries turn out to be Datalog queries with counting, and can be 
obtained for two main classes of sets of MDs; (a) MDs do not depend on each other, 
i.e. non- interacting sets of MDs [13]; (b) MDs depend cyclically on each other, e.g. a 
set containing R[A] « R[A] -^ R[B] = R[B] and R[B] « R[B] -^ R[A] = R[A] (or 
relationships like this by transitivity). 

Here cycles help us, because the termination condition for the chase imposes a sim- 
ple form on the minimally resolved instances (easier to capture and characterize) [14]. 
For these sets of MDs a conjunctive query can be rewritten to retrieve, in polynomial 
time, the resolved answers, provided there are no joins on existentially quantified vari- 
ables corresponding to changeable attributes; unchangeable attribute join conjunctive 
(UJCQ) queries [15]. For example, for the MD R[A] = R[A] -^ R[B,C] = R[B, C] 
on schema R[A,B,C], Q : 3x3y3z{R{x,y,c) A R{z,y,d)) is not UJCQ; whereas 
Q' : 3x3z{R{x, y, z) A R{x, y' , z') is UJCQ. For queries outside UJCQ, the resolved 
answer problem can be intractable even for one MD [15]. 

The case of a set of MDs consisting of 

R[A] « R[A] -> R[B] = R[B] and R[B] w R[B] -^ R[C] = R[C], (3) 

which is neither non-interacting nor cyclic, is not covered by the positive cases for 
Datalog rewriting above. Actually, for this set RQA becomes intractable for very simple 
queries, like Q(x, z) : 3yR{x, y, z), that is UJCQ [13]. 

4 Intractability of Computing Resolved Query Answers 

In the previous section we briefly described classes of queries and MDs for which RQA 
can be done in polynomial time in data (via the Datalog rewriting). We also showed that 
there are intractable cases, by pointing to a specific query and set of MDs. The questions 
that naturally arise are; (a) What happens outside the Datalog rewritable cases in terms 
of complexity of RQA? (b) The exhibited query and MDs correspond to a more general 
pattern for which intractability holds? We address these questions here. 

For all sets A/ of MDs we consider below, at most two relational predicates appear 
in M, and when there are two predicates, both appear in all MDs in M. According to 
the syntactic restrictions for MDs in (1), those two predicates occur in all conjuncts 
of an MD in AI. Furthermore, all the sets of MDs considered below will turn out to 



^ For some implicit connections between repairs and chase procedures, e.g. as used in data 
exchange see [16], and as used under database completion with ICs see [10]. 

^ For additional discussions of differences and connections between CQA and resolved query 
answering see [13, 15]. 



be, as previously announced, both interacting and acyclic. Both notions and others can 
be captured in terms of the MD graph, MDG{M), a directed graph, such that, for 
mi, 1712 G M, there is an edge from mi to 7112 if there is an overlap between RHS{mi) 
and LHS{m2) (the right- and left-hand sides of the arrows as sets of attributes) [13]. 
M is acyclic when MDG{M) is acyclic. Our results require several terms and notation 
that we now define. 

Definition 1. Consider a set M of MDs involving the predicates R and 5*. A change- 
able attribute query Q is a (conjunctive) query in UJCQ, containing a conjunct of the 
form R{x) or S{y) with all variables free. Such a conjunct is called a/ree occurrence 
of the predicate Roi S. D 

By definition, the class of changeable attribute queries (CHAQ) is a subclass of UJCQ. 
Both classes depend on the set of MDs at hand. For example, for the MDs in (3), 
3yR{x,y,z) € UJCQ \ CllAQ,hut3w3t{R{x,y,z) A S{x,w,t)) G CHAQ. We 
confine attention to UJCQ and subsets of it because, as mentioned in the previous sec- 
tion, intractability limits the applicability of the duplicate resolution method for queries 
outside UJCQ. The requirement that the query contains a free occurrence of i? or 5 
eliminates from consideration certain queries in UJCQ for which the resolved answer 
problem is trivially tractable. For example, for MDs in (3), the query 3y3zR{x, y, z) is 
not in CHAQ, but is tractable simply because it does not return the values of a change- 
able attribute (the resolved answers are the answers in the usual sense). 

Definition 2. A set M of MDs is hard if for every CHAQ Q, RA{Q, M) is A^P-hard. 
M is easy if for every CHAQ Q, RA{Q, M) is in PTIME. D 

Of course, a set of MDs may not be hard or easy. In the following we give some syntactic 
conditions that guarantee hardness for classes of MDs. 

Definition 3. Let m be an MD. The symmetric binary relation LRel{m) {RRel{m)) re- 
lates each pair of attributes A and S such that a conjunct of the form i?[A] K, S[B] (resp. 
R[A\ ^ S[B\) appears in LHS{m) (resp. RHS{m)). An L-component (R-component) 
of m is an equivalence class of the reflexive and transitive closure, LRel{rnY'^ (resp. 
RRel^mf), of LRel{m) (resp. RRel{m)). D 

The first results concern linear pairs of MDs, i.e. those whose graph MDG{M) 
consisting of the vertices mi and 7712, say 

mi: R[A] «i S[B] -^ R[C] = S[E], and ms: R[F] «2 S[G] -^ R[H] = S[I], (4) 

with_only an edge from toi to TO2, i.e. {R[C] U S[E]) n {R[F] U S[G]) =^ 0, whereas 
{R[H] U S[I]) n {R[A] U S[B]) = 0. The linear pair is denoted by (mi, 7712). 

Definition 4. Let {mi , 7772) be a linear pair as in (4). (a) Bj^ is a binary (reflexive and 
symmetric) relation on attributes of R: {R[Ui],R[U2]) e Br iff R\Ui] and R\U2] are 
in the same R-component of mi or the same L-component of m2. Similarly for ^5. 

(b) An R-equivalent set (R-ES) of attributes of {m,i,m,2) is an equivalence class of 
TC{Bb), the transitive closure of Br, with at least one attribute in the equivalence 
class belonging to LHS{vi2). The definition of an S-equivalentset (5-ES) is the same, 
with R replaced by S. 

(c) An (i?or S')-ES E of {mi, m,2) is bound if E n LHS{m,i) is non-empty. D 



Theorem 1. Let (mi, 7712) be a linear pair as in (4), with R and S distinct predicates. 
Assume that each similarity relation has an infinite set of mutually dissimilar elements. 
Let En and Es be the classes of i?-ESs and 5-ESs, resp. The pair {mi , TO2) is hard if 
RHS{ini) n RHS{m2) = 0, and at least one of the following does not hold; 

(a) At least one of the following is true: (i) there are no attributes of R in RHS{mi) n 
LHS{m2); (ii) all ESs in Eft, are bound; or (iii) for each L-component L of mi, 
there is an attribute of _R in L n LHS{m2)- 

(b) At least one of the following is true: (i) there are no attributes of S in RHS{mi) n 
LHS{m2)', (ii) all ESs in Es are bound; or (iii) for each L-component L of m,i, 
there is an attribute of S' in L n LHS{m,2)- □ 

Theorem 1 say s that a linear pair of MDs is hard unless the syntactic form of the MDs is 
such that there is a certain association between changeable attributes in LHS{m,2) and 
attributes in LHS{m,i) as specified by conditions (ii) and (iii). When mi is applied to 
an instance, similarities can be produced among the values of attributes of RHS{mi) 
which are not required by the chase but result from a particular choice of update values. 
Such accidental similarities affect the subsequent updates made by applying m,2, mak- 
ing the query answering problem intractable [13]. For pairs of MDs satisfying (a)(ii) or 
(a)(iii) (or (b)(ii) or (b)(iii)) in Theorem 1, the similarities resulting from applying m2 
are restricted to a subset of those that are already present among the values of attributes 
in LHS{m,i), making the problem tractable. 

However, when condition (ii) or (iii) is satisfied, accidental similarities among the 
values of attributes in RHS{mi) cannot be passed on to values of attributes in RHS{vi2). 

This result gives a syntactic condition for hardness. It is an important result, because 
it applies to many cases of practical interest. For example, the linear pair (mi, TO2) in 
(3) turns out to be hard (for all CHAQ queries, in addition to 3yR{x, y, z)). 

All syntactic conditions/constructs on attributes above, in particular, the transitive 
closures on attributes, are "orthogonal" to semantic properties of the similarity relations. 
When similarity predicates are transitive, every linear pair not satisfying the hardness 
criteria of Theorem 1 is easy. 

Theorem 2. (dichotomy for transitive similarity) Let (mi, 1112) be a linear pair with 
RHS{mi) n RHS{m,2) = 0. If the similarity operators are transitive, then (toi, m2) is 
either easy or hard. D 

The next result concerns pair-preserving acyclic sets of MDs, defined by: AI is pair- 
preserving if, for any attribute R[A\ occurring in a MD, there is only one attribute S[B\ 
such that R[A] w S[B] or R[A] = S[B] occur in an MD. These sets of MDs can be of 
arbitrary size ( still subject to the condition of containing at most two predicates). The 
pair-preserving assumption typically holds in a duplicate resolution setting, since the 
values of pairs of attributes are normally compared only if they hold the same type of 
information (e.g. they are both addresses or both names). 

Definition 5. Let M be pair-preserving and acyclic, B an attribute in A/, and AI' C AI. 
B is non-inclusive wrt. A/' if, for every m G M\M' with B G RHS{m), there is an 
attribute C such that: (a) C G LHS{m), (b) C ^ Um'eA/' LHS{m'), and (c) C is 
non-inclusive wrt. M'. D 

This is a recursive definition of non-inclusiveness. The base case occurs when C is 
not in RHS{m?) for any m, and so must be inclusive (i.e. not non-inclusive). Because 



C g LHS{m) in the definition, for any mi such that C G RHS{mi), there is an edge 
from TOi to m. Therefore, we are traversing an edge backwards with each recursive 
step, and the recursion terminates by the acycHcity assumption. 

Non-inclusiveness is a generaHzation of conditions (a) (iii) and (b) (iii) in Theorem 
1 to a set of arbitrarily many MDs. It expresses a condition of inclusion of attributes in 
the left-hand side of one MD in the left-hand side of another Theorem 3 tells us that 
a set of MDs that is non-inclusive in this sense is hard. Notice that the condition of 
Theorem 1 that there exists an ES that is not bound does not appear in Theorem 3. This 
is because, by the pair-preserving requirement, there cannot be a bound ES for any pair 
of MDs in the set. For linear pairs. Theorem 3 becomes Theorem 1 . 

Theorem 3. Let M be pair-preserving and acyclic. Assume there is {ttii, 7712} C M, 
and attributes C e RHS{m2), B e RHS{mi)f]LHS{m2) with: (a) C is non- 
inclusive wrt {7711,1112}, and (b) B is non-inclusive wrt {7712}- Then, M is hard. D 
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